feat: add ccl registry, fix profiler#109
Merged
kilinchange merged 6 commits intomasterfrom Mar 5, 2026
Merged
Conversation
kilinchange
requested changes
Mar 2, 2026
…acros, mv unique_id file helper functions to utils
85c777f to
c289934
Compare
JYMiracle305
approved these changes
Mar 5, 2026
kilinchange
approved these changes
Mar 5, 2026
Collaborator
|
贴一下测试结果截图,以及往飞书表格记录下性能。 |
Contributor
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

把原先的 DeviceGuard 及其一系列定义都放到了 core/runtime 文件夹下面。
a. DeviceGuard 里面添加个别运行时接口,包括 stream/event 相关
b. blas_handle/stream/event 等定义都放在一块,提到了 runtime_common.h 中
c. 添加了 RuntimeStatus 的定义,后续考虑把所有接口返回值都设置为 RuntimeStatus,与 cuda runtime api 对齐
d. 所有文件里面 include 的路径也对应修改
Profiler 部分代码去除平台特化宏,通过调用 DeviceGuardImpl 里面提供的 runtime api 实现;此外,定位了一下带 vpp 的分布式下 Profiler 报错的问题,源于多线程读写冲突,加了互斥锁后修复。
======下面的部分和通信库/分布式相关======
创建 core/ccl/ 目录,添加 CclGroupGuard、CclImpl、CclImplRegistry(对标 DeviceGuard、DeviceGuardImpl、DeviceGuardImplRegistry):
a. CclImplRegistry 用于注册不同平台的后端通信库
b. CclImpl 涵盖了通信库相关的所有接口定义
c. CclGroupGuard 以 RAII 作用域的方式自动包起了一个类似 ncclGroupStart() 和 ncclGroupEnd() 的区域
d. ccl_common.h 里面有 comm/unique_id/ccl_status 的定义
去除 ProcessGroup、Work 部分的平台特化代码,通过调用 CclImpl 里面提供的通信库 api 实现。有几个细节:
a. ProcessGroup 添加一个 DeviceType 类型的成员称作 backend,靠此来拿到对应的 CclImpl,调用注册平台的通信库 api
b. 添加 ProcessGroupFactory::Instance(DeviceType) 接口,要求创建或者获取工厂的时候需要传入 backend 参数;同时把原先无参版本 ProcessGroupFactory::Instance() 的语义改为类似于 const 的存在,仅获取已用某后端初始化过的工厂,而不会创建;实际的 static 的 instance 声明位置放到全局